Introduction to Network Analysis - Application (Python)
Contents
Introduction to Network Analysis - Application (Python)#
Roman Jurowetzki
!pip install -U python-louvain
import networkx as nx
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
Network data structures#
Below an example of a minimal edge list created by zipping together two lists into a list of tuples.
In this case, let us assume this network to be unweighted, meaning a connection can be eiter tresent or absent.
origin = [1, 2, 2, 1, 4]
target = [2, 3, 4, 5, 1]
edge_list = list(zip(origin, target))
print(edge_list)
[(1, 2), (2, 3), (2, 4), (1, 5), (4, 1)]
This can of cause also be a DataFrame where columns indicate origin and target
df = pd.DataFrame({'origin':origin, 'target':target})
df
| origin | target | |
|---|---|---|
| 0 | 1 | 2 |
| 1 | 2 | 3 |
| 2 | 2 | 4 |
| 3 | 1 | 5 |
| 4 | 4 | 1 |
The NetowrkX library allows us to do many network-related things in Python. While there are other packages for that, NetworkX remains the most popular choice.
A network can be initiated from a list of tuples (more basic) from numpy arrays or from DataFrames
G = nx.Graph()
G.add_edges_from(edge_list)
G_from_df = nx.from_pandas_edgelist(df, source = 'origin', target = 'target')
print(G.nodes() == G_from_df.nodes())
print(G.edges() == G_from_df.edges())
True
True
Adjacency Matrix#
A second popular form of network representation is the adjacency-matrix (also called socio-matrix).
It is represented as a \(n*n\) matrix, where \(n\) stands for the number of elements of which their relationships should be represented.
The value in the cell that intercepts row \(n\) and column \(m\) indicates if an edge is present (=1) or absent (=0).
Tip: Given an edgelist, an adjacency matrix can easily be produced by crosstabulating:
df_cros = pd.crosstab(df.origin, df.target)
df_cros
| target | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| origin | |||||
| 1 | 0 | 1 | 0 | 0 | 1 |
| 2 | 0 | 0 | 1 | 1 | 0 |
| 4 | 1 | 0 | 0 | 0 | 0 |
idx = df_cros.columns.union(df_cros.index)
df_adj_asy = df_cros.reindex(index = idx, columns=idx, fill_value=0)
df_adj_asy
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| 1 | 0 | 1 | 0 | 0 | 1 |
| 2 | 0 | 0 | 1 | 1 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 | 0 |
nx.to_pandas_adjacency(G)
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| 1 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 |
| 2 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 |
| 3 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 4 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 5 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
df_adj_asy + df_adj_asy.T
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| 1 | 0 | 1 | 0 | 1 | 1 |
| 2 | 1 | 0 | 1 | 1 | 0 |
| 3 | 0 | 1 | 0 | 0 | 0 |
| 4 | 1 | 1 | 0 | 0 | 0 |
| 5 | 1 | 0 | 0 | 0 | 0 |
from scipy import sparse
sparse_df = sparse.csr_matrix(df_adj_asy)
print(sparse_df)
(0, 1) 1
(0, 4) 1
(1, 2) 1
(1, 3) 1
(3, 0) 1
edge_list
[(1, 2), (2, 3), (2, 4), (1, 5), (4, 1)]
Nodelists#
Edgelists as well as adjacency matrices only stores connectivity pattern between nodes, but due to their structure cannot store informations on the nodes in which we might be interested.
Therefore, we in many cases also provide a a node list with these informations (such as the names of the nodes or any kind of groupings).
name = ["Jesper", "Pernille", "Jacob", "Dorte", "Donald"]
sex = ["M", "F", "M", "F", "M"]
group = ["A", "B", "B", "A", "C"]
node_list = pd.DataFrame({'name':name, 'sex':sex, 'group':group})
node_list
| name | sex | group | |
|---|---|---|---|
| 0 | Jesper | M | A |
| 1 | Pernille | F | B |
| 2 | Jacob | M | B |
| 3 | Dorte | F | A |
| 4 | Donald | M | C |
NetworkX Graphs#
We have already seen a sneak-preview of NetworkX functionality
There are many weays to create various types of Graph objects in NetworkX and I’d advise you to have a look at the documentation
Let’s combine the data from above and create a graph. While it is easy to create a graph from an edgelist it cannot be inferred from a nodelist directly.
dict(enumerate(name))
{0: 'Jesper', 1: 'Pernille', 2: 'Jacob', 3: 'Dorte', 4: 'Donald'}
mapper = dict(enumerate(name, 1))
print(mapper)
{1: 'Jesper', 2: 'Pernille', 3: 'Jacob', 4: 'Dorte', 5: 'Donald'}
print([mapper[x] for x in origin])
origin = [mapper[x] for x in origin]
target = [mapper[x] for x in target]
['Jesper', 'Pernille', 'Pernille', 'Jesper', 'Dorte']
edge_list = list(zip(origin, target))
print(edge_list)
[('Jesper', 'Pernille'), ('Pernille', 'Jacob'), ('Pernille', 'Dorte'), ('Jesper', 'Donald'), ('Dorte', 'Jesper')]
G = nx.Graph()
G.add_edges_from(edge_list)
node_attr = node_list.set_index('name').to_dict('index')
node_attr
{'Jesper': {'sex': 'M', 'group': 'A'},
'Pernille': {'sex': 'F', 'group': 'B'},
'Jacob': {'sex': 'M', 'group': 'B'},
'Dorte': {'sex': 'F', 'group': 'A'},
'Donald': {'sex': 'M', 'group': 'C'}}
nx.set_node_attributes(G, node_attr)
G.nodes(data=True)
NodeDataView({'Jesper': {'sex': 'M', 'group': 'A'}, 'Pernille': {'sex': 'F', 'group': 'B'}, 'Jacob': {'sex': 'M', 'group': 'B'}, 'Dorte': {'sex': 'F', 'group': 'A'}, 'Donald': {'sex': 'M', 'group': 'C'}})
G.edges()
EdgeView([('Jesper', 'Pernille'), ('Jesper', 'Donald'), ('Jesper', 'Dorte'), ('Pernille', 'Jacob'), ('Pernille', 'Dorte')])
len(G.nodes())
5
# Not pretty at all
plt.figure(figsize=(5,5))
nx.draw_kamada_kawai(G, with_labels = True)
G['Jacob']
AtlasView({'Pernille': {}})
G['Jesper']
AtlasView({'Pernille': {}, 'Donald': {}, 'Dorte': {}})
G.nodes()['Jesper']
{'sex': 'M', 'group': 'A'}
[node for node, data in G.nodes(data=True) if data['sex'] == 'F']
['Pernille', 'Dorte']
node_list[node_list.sex == 'F'].name
1 Pernille
3 Dorte
Name: name, dtype: object
G_sub = nx.subgraph(G, node_list[node_list.sex == 'F'].name)
pd.DataFrame.from_dict(dict(G.nodes(data=True)), orient='index')
| sex | group | |
|---|---|---|
| Jesper | M | A |
| Pernille | F | B |
| Jacob | M | B |
| Dorte | F | A |
| Donald | M | C |
Network analysis and measures#
# Generate sample small world network
g = nx.watts_strogatz_graph(200, 3 , 5)
# Quick visualization
nx.draw(g, with_labels = False, node_size=10)
Node level measures#
Centralities can be easily created on node level various centrality algorithms.
centrality_dgr = nx.degree_centrality(g)
centrality_eigen = nx.eigenvector_centrality_numpy(g)
centrality_between = nx.betweenness_centrality(g)
nd_attrb_df = pd.DataFrame({'centrality_dgr':centrality_dgr,
'centrality_eigen':centrality_eigen,
'centrality_between':centrality_between})
nd_attrb = nd_attrb_df.to_dict('index')
nx.set_node_attributes(g, nd_attrb)
# Degree centrality
nx.draw(g, with_labels = False, node_size=[v * 5000 for v in centrality_dgr.values()])
# Eigenvalue centrality
nx.draw(g, with_labels = False, node_size=[v * 500 for v in centrality_eigen.values()])
# Betweenness centrality
nx.draw(g, with_labels = False, node_size=[v * 500 for v in centrality_between.values()])
Clustering (Community detection)#
Clustering algorithms are similar to UML algorithms in traditional ML - there are many diffrent approaches and the data structure is often the determinant of what is makes sense and should be used.
While there are several community detection algorithms in the NetworkX library, one of the most known and used - the Louvain algorithm is maintained as an own package (http://python-louvain.readthedocs.io/)
It is used in a very similar fashion to other functions in NetworkX and returns similar objects.
G = nx.random_partition_graph(np.random.randint(15,18,5), 0.75, 0.05)
nx.draw_kamada_kawai(G, node_size=10)
import community as community_louvain
partition = community_louvain.best_partition(G)
nx.set_node_attributes(G, partition, 'partition')
nx.draw_kamada_kawai(G, node_color=list(partition.values()))
Neighborhood of a Node#
Lets check the size of all nodes neighborhood at distance 2.
g1 = nx.ego_graph(G, 1, radius=2)
g50 = nx.ego_graph(G, 50, radius=2)
g1_df = pd.DataFrame.from_dict(dict(g1.nodes(data=True)), orient='index')
g50_df = pd.DataFrame.from_dict(dict(g50.nodes(data=True)), orient='index')
nx.draw_kamada_kawai(g1, node_color=g1_df.partition)
nx.draw_kamada_kawai(g50, node_color=g50_df.partition)
(Global) Network structure#
Finally, it is often also informative to look at the overal characteristics of the network. We will do this in more detail next time, but just so you know:
The density of a measure represents the share of all connected to all possible connections in the network
nx.density(G)
0.18157181571815717
Transistivity, also called the Clustering Cefficient indicates how much the network tends to be locally clustered.
That is measured by the share of closed triplets. Again, we will dig into that next time.
nx.transitivity(G)
0.49492924528301885
The diameter is the longest of the shortest paths between two nodes of the network.
nx.diameter(G)
3
Finally, the mean distance, or average path lenght represents the mean of all shortest paths between all nodes. It is a measure of diffusion potential within a network.
nx.average_shortest_path_length(G)
2.0689551339957846
Case: Networks are coming…#

So, lets get serious. Appropriate for the weather these days in Denmark, the theme is “winter is comming…”.
Therefore, we will have some fun analysing the Game of Thrones data provided by Andrew Beveridge.
It is a Character Interaction Networks for George R. R. Martin’s “A Song of Ice and Fire” saga (yes, we are talking about the books…).
These networks were created by connecting two characters whenever their names (or nicknames) appeared within 15 words of one another in one of the books in “A Song of Ice and Fire.”
The edge weight corresponds to the number of interactions.
This is a nice skill you will have after the second part of M2 on your own.
Build the graph#
First, we load all nodes, representing all characters appearing in the books:
edges = pd.read_csv('https://sds-aau.github.io/SDS-master/00_data/GoT_network/asoiaf-all-edges.csv')
edges.head()
| Source | Target | Type | id | weight | |
|---|---|---|---|---|---|
| 0 | Addam-Marbrand | Brynden-Tully | Undirected | 0 | 3 |
| 1 | Addam-Marbrand | Cersei-Lannister | Undirected | 1 | 3 |
| 2 | Addam-Marbrand | Gyles-Rosby | Undirected | 2 | 3 |
| 3 | Addam-Marbrand | Jaime-Lannister | Undirected | 3 | 14 |
| 4 | Addam-Marbrand | Jalabhar-Xho | Undirected | 4 | 3 |
edges.columns = [w.lower() for w in edges.columns]
So, that’s what we have, a classical edgelist, with id1 in column 1 and id2 in column2.
Note, the edges are in this case weighted.
Ok, lets see how many characters we have overall.
pd.concat([edges['source'],edges['target']], axis=0).nunique()
796
len(set(edges['source']) | set(edges['target']))
796
Because there are so many characters in the books, many of them minor,
I am subsetting the data to the 100 characters with the most interactions across all books.
The edges are undirected, therefore there are no redundant Source-Target combinations.
Because of this, I pivot Source and Target data before summing up the weights.
edges_melt = pd.melt(edges, id_vars=['id','weight'], value_vars=['source', 'target'],
var_name='role', value_name='name')
edges_melt.head()
| id | weight | role | name | |
|---|---|---|---|---|
| 0 | 0 | 3 | source | Addam-Marbrand |
| 1 | 1 | 3 | source | Addam-Marbrand |
| 2 | 2 | 3 | source | Addam-Marbrand |
| 3 | 3 | 14 | source | Addam-Marbrand |
| 4 | 4 | 3 | source | Addam-Marbrand |
edges_melt.groupby('name').weight.sum().sort_values(ascending=False)
name
Tyrion-Lannister 2873
Jon-Snow 2757
Cersei-Lannister 2232
Joffrey-Baratheon 1762
Eddard-Stark 1649
...
Garth-Tyrell 3
Hugh 3
Old-Bill-Bone 3
Owen-Merryweather 3
Torwold-Browntooth 3
Name: weight, Length: 796, dtype: int64
chars_main = edges_melt.groupby('name').weight.sum().sort_values(ascending=False).index[:100]
So far so good, if we only go by edge weights,
Lets reduce our edgelist to this main characters, just to warm up and keep the overview.
edges = edges[(edges.source.isin(chars_main)) & (edges.target.isin(chars_main))]
Now we can convert that into a NetowkrX graph
g = nx.from_pandas_edgelist(edges, source='source', target='target', edge_attr='weight')
# no need to filter for parallel edges and seems there are no isolates (degree 0 nodes)
list(nx.isolates(g))
[]
g.edges(data=True)
weights = [z['weight'] for x,y,z in g.edges(data=True)]
sns.displot(weights)
<seaborn.axisgrid.FacetGrid at 0x7fd98c46bd50>
We see a right skewed distribution with many weak and some very strong edges. Lets take a look what are the edges with the highest weight (meaning here: the characters with most intraction).
centrality_dgr = nx.degree_centrality(g)
centrality_eigen = nx.eigenvector_centrality_numpy(g, weight='weight')
centrality_between = nx.betweenness_centrality(g, weight='weight')
nx.set_node_attributes(g, centrality_dgr, 'centrality_dgr')
nx.set_node_attributes(g, centrality_eigen, 'centrality_eigen')
nx.set_node_attributes(g, centrality_between, 'centrality_between')
g.nodes(data=True)['Tyrion-Lannister']
{'centrality_dgr': 0.5454545454545455,
'centrality_eigen': 0.3789114457918316,
'centrality_between': 0.03467326324469181}
g.degree(weight='weight')['Tyrion-Lannister']
2170
Communities & Groups#
partition = community_louvain.best_partition(g)
nx.set_node_attributes(g, partition, 'community')
nodes_df = pd.DataFrame.from_dict(dict(g.nodes(data=True)),orient='index')
nodes_df.groupby('community')['centrality_eigen'].nlargest(5)
community
0 Jon-Snow 0.165477
Samwell-Tarly 0.059171
Jeor-Mormont 0.048872
Janos-Slynt 0.036871
Aemon-Targaryen-(Maester-Aemon) 0.031288
1 Robert-Baratheon 0.287403
Eddard-Stark 0.286136
Stannis-Baratheon 0.152189
Varys 0.139816
Petyr-Baelish 0.136667
2 Barristan-Selmy 0.048767
Daenerys-Targaryen 0.030154
Jorah-Mormont 0.020294
Viserys-Targaryen 0.009412
Drogo 0.005765
3 Robb-Stark 0.174454
Catelyn-Stark 0.165325
Bran-Stark 0.127022
Lysa-Arryn 0.057502
Rickon-Stark 0.048728
4 Tyrion-Lannister 0.378911
Cersei-Lannister 0.360116
Joffrey-Baratheon 0.346213
Sansa-Stark 0.276120
Jaime-Lannister 0.215525
Name: centrality_eigen, dtype: float64
Network Visualization I#
Ok, lets give it a first minimal shot:
nx.draw(g, with_labels = True, node_size=10)
!pip install -qq holoviews
!pip install -qq -U bokeh
!pip install -qq datashader
|████████████████████████████████| 18.5 MB 49.2 MB/s
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
panel 0.12.1 requires bokeh<2.4.0,>=2.3.0, but you have bokeh 2.4.3 which is incompatible.
|████████████████████████████████| 18.2 MB 782 kB/s
|████████████████████████████████| 76 kB 5.3 MB/s
?25h Building wheel for datashape (setup.py) ... ?25l?25hdone
# Import the libraries and link to the bokeh backend
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
from bokeh.plotting import show
kwargs = dict(width=800, height=800, xaxis=None, yaxis=None)
opts.defaults(opts.Nodes(**kwargs), opts.Graph(**kwargs))
# Create and save a layout.
g_layout = nx.layout.spring_layout(g)
g_plot = hv.Graph.from_networkx(g, g_layout).opts(tools=['hover'], node_color='community')
labels = hv.Labels(g_plot.nodes, ['x', 'y'], 'index')
from holoviews.operation.datashader import datashade, bundle_graph
bundled = bundle_graph(g_plot)
show(hv.render(bundled * labels.opts(text_font_size='6pt', text_color='white', bgcolor='gray')))